标题	说明	附加
《Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures》	原始论文	2018
为什么使用自注意力机制？	机器之心浅析

Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures

Gongbo Tang, Mathias Müller, Annette Rios, Rico Sennrich
(Submitted on 27 Aug 2018 (v1), last revised 28 Aug 2018 (this version, v2))

Recently, non-recurrent architectures (convolutional, self-attentional) have outperformed RNNs in neural machine translation. CNNs and self-attentional networks can connect distant words via shorter network paths than RNNs, and it has been speculated that this improves their ability to model long-range dependencies. However, this theoretical argument has not been tested empirically, nor have alternative explanations for their strong performance been explored in-depth. We hypothesize that the strong performance of CNNs and self-attentional networks could also be due to their ability to extract semantic features from the source text, and we evaluate RNNs, CNNs and self-attention networks on two tasks: subject-verb agreement (where capturing long-range dependencies is required) and word sense disambiguation (where semantic feature extraction is required). Our experimental results show that: 1) self-attentional networks and CNNs do not outperform RNNs in modeling subject-verb agreement over long distances; 2) self-attentional networks perform distinctly better than RNNs and CNNs on word sense disambiguation.

近期，非循环架构（卷积、自注意力）在神经机器翻译任务中的表现优于 RNN。CNN 和自注意力网络连接远距离单词的路径比 RNN 短，有研究人员推测这正是其建模长距离依赖能力得到提高的原因。但是，这一理论论断并未得到实验验证，对这两种网络的强大性能也没有其他深入的解释。我们假设 CNN 和自注意力网络的强大性能也可能来自于其从源文本提取语义特征的能力。我们在两个任务（主谓一致任务和词义消歧任务）上评估了 RNN、CNN 和自注意力网络的性能。实验结果证明：1）自注意力网络和 CNN 在建模长距离主谓一致时性能并不优于 RNN；2）自注意力网络在词义消歧方面显著优于 RNN 和 CNN。
Comments: 10 pages, 5 figures, accepted by EMNLP 2018 (v2: corrected author names)
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:1808.08946 [cs.CL]
(or arXiv:1808.08946v2 [cs.CL] for this version)

本论文的主要贡献如下：

检验了这一理论断言：具备更短路径的架构更擅长捕捉长距离依赖。研究者在建模长距离主谓一致任务上的实验结果并没有表明，Transformer 或 CNN 在这方面优于 RNN。

通过实验证明 Transformer 中注意力头的数量对其捕捉长距离依赖的能力有所影响。具体来说，多头注意力对使用自注意力机制建模长距离依赖是必要的。

通过实验证明 Transformer 擅长 WSD，这表明 Transformer 是强大的语义特征提取器。